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Abstract 

We describe an incremental unsupervised 
procedure to learn words from transcribed 
continuous speech. The algorithm is based 
on a conservative and traditional statistical 
model, and results of empirical tests show 
that it is competitive with other algorithms 
that have been proposed recently for this 
task. 



1. Introduction 

English speech lacks the acoustic analog of blank 
spaces that people arc accustomed to seeing between 
words in written text. Discovering words in continu- 
ous spoken speech then is an interesting problem that 
has been treated at length in the literature. The issue 
is also particularly prominent in the parsing of written 
text in languages that do not explicitly include spaces 
between words. 

In this paper, we describe an incremental unsuper- 
vised algorithm based on a formal statistical model 
to infer word boundaries from continuous speech. The 
main contributions of this study are as follows: First, 
it demonstrates the applicability and competitiveness 
of a conservative traditional approach for a task for 
which nontraditional approaches have been proposed 
even recently (?; ?; ?; ?; ?). Second, although the 
model leads to the development of an algorithm that 
learns the lexicon in an unsupervised fashion, results 
of partial supervision are also presented, showing that 
its performance is consistent with results from learning 
theory. 

2. Related Work 

While there exists a reasonable body of literature with 
regard to word discovery and text segmentation, espe- 
cially with respect to languages such as Chinese and 



Japanese, which do not explicitly include spaces be- 
tween words, most of the statistically based models 
and algorithms tend to fall into the supervised learning 
category. These require the model to first be trained 
on a large corpus of text before it can segment its 
input 1^ It is only of late that interest in unsuper- 
vised algorithms for text segmentation seems to have 
gained ground. In the last ANLP/NAACL joint lan- 
guage technology conference, ? (?) proposed an al- 
gorithm to infer word boundaries from character n- 
gram statistics of Japanese Kanji strings. For exam- 
ple, a decision to insert a word boundary between two 
characters is made solely based on whether charac- 
ter n-grams adjacent to the proposed boundary are 
relatively more frequent than character n-grams that 
straddle it. However, even this algorithm is not truly 
unsupervised. There is a threshold parameter involved 
that must be tuned in order to get optimal segmenta- 
tions when single character words are present. Also, 
the number of orders of n-grams that are significant in 
the segmentation decision making process is a tunable 
parameter. The authors state that these parameters 
can be set with a very small number of pre-segmented 
training examples, as a consequence of which they call 
their algorithm mostly unsupervised. A further factor 
contributing to the incommensurability of their algo- 
rithm with our approach is that it is not immediately 
obvious how to adapt their algorithm to operate in- 
crementally. Their procedure is more suited to batch 
segmentation, where corpus n-gram statistics can be 
obtained during a first pass and segmentation deci- 
sions made during the second. Our algorithm, how- 
ever, is purely incremental and unsupervised and does 
not need to make multiple passes over the data, nor re- 
quire tunable parameters to be set from training data 
beforehand. In this respect, it is most similar to Model 
Based Dynamic Programming, hereafter referred to as 
MBDP-1, which has been proposed in (?). To the 

^See, for example, ? (?) for a survey and ? (?) for the 
most recent such approach. 



author's knowledge, MBDP-1 is probably the most 
recent and only other completely unsupervised work 
that attempts to discover word boundaries from un- 
segmented speech data. Both the approach presented 
in this paper and MBDP-1 are based on explicit prob- 
ability models. As the name implies, MBDP-1 uses 
dynamic programming to infer the best segmentation 
of the input corpus. It is assumed that the entire input 
corpus, consisting of a concatenation of all utterances 
in sequence, is a single event in probability space and 
that the best segmentation of each utterance is implied 
by the best segmentation of the corpus itself. The 
model thus focuses on explicitly calculating probabili- 
ties for every possible segmentation of the entire cor- 
pus, subsequently picking the segmentation with the 
maximum probability. More precisely, the model at- 
tempts to calculate 

p(w„o = E E E E p(*™i"' /' ^) • /' ^) 



for each possible segmentation of the input corpus 
where the left-hand side is the exact probability of 
that particular segmentation of the corpus into words 
lUm = W1W2 ■ ■ ■ Wm and the sums are over all possible 
numbers of words, n, in the lexicon, all possible lex- 
icons, L, all possible frequencies, /, of the individual 
words in this lexicon and all possible orders of words, s, 
in the segmentation. In practice, the implementation 
uses an incremental approach that computes the best 
segmentation of the entire corpus up to step i, where 
the ith step is the corpus up to and including the zth 
utterance. Incremental performance is thus obtained 
by computing this quantity anew after each segmen- 
tation i — assuming, however, that segmentations of 
utterances up to but not including i are fixed. Thus, 
although the segmentation algorithm itself is incre- 
mental, the formal statistical model of segmentation 
is not. 

Furthermore, making the assumption that the corpus 
is a single event in probability space significantly in- 
creases the computational complexity of the incremen- 
tal algorithm. The approach presented in this pa- 
per circumvents these problems through the use of a 
conservative statistical model that is directly imple- 
mcntable as an incremental algorithm. In the follow- 
ing sections, we describe the model and the algorithm 
derived from it. The technique can basically be seen 
as an modification of Brent's work, borrowing in par- 
ticular his successful dynamic programming approach 
while substituting his statistical model with a more 
conservative one. 



3. Model Description 

The language model described here is fairly standard 
in nature. The interested reader is referred to ?, (?, 
p. 57-78), where a detailed exposition can be found. 
Basically, we seek 
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where W = wi , • • • , w„ denotes a particular string of 
n words. Each word is assumed to be made up of 
a finite sequence of characters representing phonemes 
from a finite inventory. 

We make the unigram approximation that word his- 
tories are irrelevant to their probabilities. This allows 
us to rewrite the right-hand side of Equation ^ as un- 
conditional probabilities. We also employ back-off (?) 
using the Witten-Bell technique (?) when novel words 
are encountered. This enables us to use an open vo- 
cabulary and estimate familiar word probabilities from 
their relative frequencies in the observed corpus while 
backing off to the letter level for novel words. In our 
case, a novel word is decomposed into its constituent 
phonemes and its probability is then calculated as the 
normalized product of its phoneme probabilities. To 
do this, we introduce the sentinel phoneme which 
is assumed to terminate every word. The model can 
now be summarized very simply as follows: 
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where C() denotes the count or frequency function, 
N denotes the number of distinct words in the word 
table, S denotes the sum of their frequencies, \w\ de- 
notes the length of word w, excluding the sentinel 
w[j] denotes its jth phoneme, and r() denotes the rel- 
ative frequency function. The normalization by divid- 
ing using 1 — r(#) in Equation (||) is necessary because 
otherwise 

00 

J2p{w) = ^(1-P(#))»P(#) (6) 
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Since we estimate P(?i'[j]) by r{w[j]), dividing by 1 — 
r(#) will ensure that J2w^i'^) ~ ^■ 

4. Method 

As in ? (?), the model described in Section ^ is pre- 
sented as an incremental learner. The only knowl- 
edge built into the system at start-up is the phoneme 
table with a uniform distribution over all phonemes, 
including the sentinel phoneme. The learning algo- 
rithm considers each utterance in turn and computes 
the most probable segmentation of the utterance us- 
ing a Viterbi search (?) implemented as a dynamic 
programming algorithm described shortly. The most 
likely placement of word boundaries computed thus is 
committed to before considering the next presented ut- 
terance. Committing to a segmentation involves learn- 
ing word probabilities as well as phoneme probabilities 
from the inferred words. These are used to update 
their respective tables. To account for effects that any 
specific ordering of input utterances may have on the 
segmentations that are output, the performance of the 
algorithm is averaged over 1000 runs, with each run 
receiving as input a random permutation of the input 
corpus. 

The input corpus 

The corpus, which is identical to the one used by ? 
(?), consists of orthographic transcripts made by ? 
(?) from the CHILDES collection (?). The speak- 
ers in this study were nine mothers speaking freely to 
their children, whose ages averaged 18 months (range 
13-21). Brent and his colleagues also transcribed the 
corpus phonemically (using an ASCII phonemic rep- 
resentation), ensuring that the number of subjective 
judgments in the pronunciation of words was mini- 
mized by transcribing every occurrence of the same 
word identically. For example, "look" , "drink" and 
"doggie" were always transcribed "lUk" , "drINk" and 
"dOgi" regardless of where in the utterance they oc- 
curred and which mother uttered them in what way. 
Thus transcribed, the corpus consists of a total of 9790 
such utterances and 33,399 words including one space 
after each word and one newline after each utterance. 

It is noteworthy that the choice of this particular cor- 
pus for experimentation is motivated purely by its use 
in ? (?). The algorithm is equally applicable to plain 
text in English or other languages. The main advan- 
tage of the CHILDES corpus is that it allows for ready 
and quick comparison with results hitherto obtained 
and reported in the literature. Indeed, the relative 
performance of all the discussed algorithms is mostly 
unchanged when tested on the 1997 Switchboard tele- 



phone speech corpus with disfluency events removed. 
5. Algorithm 

The dynamic programming algorithm finds the most 
probable word sequence for each input utterance by 
assigning to each segmentation a score equal to the 
logarithm of its probability and committing to the seg- 
mentation with the highest score. In practice, the im- 
plementation computes the negative logarithm of this 
score and thus commits to the segmentation with the 
least negative logarithm of the probability. The al- 
gorithm is presented in recursive form in Figure |l|for 
readability. The actual implementation, however, used 
an iterative version. The algorithm to evaluate the 
back-off probability of a word is given in Figure |^. Es- 
sentially, the algorithm description can be summed up 
semiformally as follows: For each input utterance u, 
which has either been read in without spaces, or from 
which spaces have been deleted, we evaluate every pos- 
sible way of segmenting it as u = u' -\- w where u' is 
a subutterance from the beginning of the original ut- 
terance up to some point within it and w, the lexical 
difference between u and u', is treated as a word. The 
subutterance u' is itself evaluated recursively using the 
same algorithm. The base case for recursion when 
the algorithm rewinds is obtained when a subutter- 
ance cannot be split further into a smaller component 
subutterance and word, that is, when its length is zero. 
Suppose for example, that a given utterance is abcde, 
where the letters represent phonemes. If seg(a;) rep- 
resents the best segmentation of the utterance x and 
virord(x) denotes that x is treated as a word, then 



seg{abcde) = best of 



word{abcde) 
seg(a) + word(focrfe) 
seg(a6) + word{cde) 
seg{abc) + word((ie) 
seg{abcd) + word(e) 



The evalUtterance algorithm in Figure |l| does pre- 
cisely this. It initially assumes the entire input utter- 
ance to be a word on its own by assuming a single 
segmentation point at its right end. It then compares 
the log probability of this segmentation successively to 
the log probabilities of segmenting it into all possible 
subutterance, word pairs. Once the best segmentation 
into words has been found, then spaces are inserted 
into the utterance at the inferred points and the seg- 
mented utterance is printed out. 

The implementation maintains two separate tables in- 
ternally, one for words and one for phonemes. When 
the procedure is initially started, the word table is 
empty. Only the phoneme table is populated with 



equipossible phonemes. As the program considers each 
utterance in turn and commits to its best segmenta- 
tion according to the evalUtterance algorithm, the 
two tables are updated correspondingly. For example, 
after some utterance "abcde" is segmented into "a be 
de", the word table is updated to increment the fre- 
quencies of the three entries "a" , "be" and "de" each 
by 1, and the phoneme table is updated to increment 
the frequencies of each of the phonemes in the utter- 
ance including one sentinel for each word inferred. Of 
course, incrementing the frequency of a currently un- 
known word is equivalent to creating a new entry for 
it with frequency 1. 

5.1 Algorithm: evalUtterance 

BEGIN 

Input (by ref) utterance u[0..n] 
where u[i] are the characters in it. 

bestSegpoint := n; 

bestScore := evalWord(u[0. .n] ) ; 

for i from to n-1; do 

subUtterance := copy (u[0 . . i] ) ; 

word := copy (u[i+l . .n] ) ; 

score := evalUtterance (subUtterance) 

+ eval Word (word) ; 
if (score < bestScore) ; then 
bestScore = score; 
bestSegpoint := i; 

fi 
done 

insertWordBoundary (u, bestSegpoint) 
return bestScore; 

END 

Figure 1. Recursive optimization algorithm to find the best 
segmentation of an input utterance using the language 
model described in this paper. 

One can easily see that the running time of the pro- 
gram is 0(rnn?) in the total number of utterances (m) 
and the length of each utterance (n), assuming an ef- 
ficient implementation of a hash table allowing nearly 
constant lookup time is available. A single run over the 
entire corpus typically completes in under 10 seconds 
on a 300 MHz i686-based PC running Linux 2.2.5-15. 
Although all the discussed algorithms tend to complete 
within one minute on the reported corpus, MBDP-l's 
running time is quadratic in the number of utterances, 
while the language model presented here enables com- 
putation in almost linear time. The typical running 
time of MBDP-1 on the 9790-utterance corpus aver- 
ages around 40 seconds per run on a 300 MHz 1686 
PC while the algorithm described in this paper aver- 
ages around 7 seconds. 



5.2 Function: evalWord 

BEGIN 

Input (by reference) word w[0..k] 
where w [i] are the phonemes in it . 

score := 0; 

N := number of distinct words; 
S := sum of their frequencies; 
if freq(word) == 0; then { 

escape := N/(N+S) ; 

P_0 := relativeFrequency( '#' ) ; 

score := -log(esc) -log(P_0/(l-P_0) ) ; 

for each w[i] ; do 

score -= log(relativeFrequency(w[i] ) ) ; 

done 
y else -[ 

P_w := frequency (w)/(N + S) ; 

score := -log(P_w) ; 

> 

return score; 

END 

Figure 2. The function to compute — logP(i(;) of an input 
word w. If the word is novel, then the function backs off 
to using a distribution over the phonemes in the word. 

6. Results and Discussion 

In line with the results reported in ? (?), three scores 
were calculated — precision, recall and lexicon pre- 
cision. Precision is defined as the proportion of pre- 
dicted words that are actually correct. Recall is de- 
fined as the proportion of correct words that were 
predicted. Lexicon precision is defined as the propor- 
tion of words in the predicted lexicon that are correct. 
Precision and recall scores were computed incremen- 
tally and cumulatively within scoring blocks, each of 
which consisted of 100 utterances. We emphasize that 
the segmentation itself proceeded incrementally, on an 
utterance-by-utterance basis. Only the scores are re- 
ported on a per-block basis for brevity. These scores 
were computed and averaged only for the utterances 
within each block scored and thus they represent the 
performance of the algorithm on the block of utter- 
ances scored, occurring in the exact context among 
the other scoring blocks. Lexicon scores carried over 
blocks cumulatively. As Figures ^ through || show, the 
performance of our algorithm matches that of MBDP- 
1 on all grounds. In fact, we found to our surprise 
that the performances of both algorithms were almost 
identical except in a few instances, discussion of which 
space does not permit here. 

This leads us to suspect the two, substantially differ- 
ent, statistical models may essentially be capturing the 
same nuances of the domain. Although ? (?) ex- 
plicitly states that probabilities are not estimated for 
words, it turns out that considering the entire corpus 
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Figure 3. Word discovery precision as a function of number 
of utterances considered. Each scoring block (checkpoint) 
consists of 10% of the total number of utterances (roughly 
1000). It is hard to discern two separate plots above be- 
cause of the close match in their performance. 1-gram de- 
notes the performance of the procedure reported in this 
paper whereas MBDP denotes the performance of Brent's 
Model Based Dynamic Programming algorithm. 

as a single event in probability space does end up hav- 
ing the same effect as estimating probabilities from 
relative frequencies as our statistical model does. The 
relative probability of a familiar word is given in Equa- 
tion 22 of ? (?) as 

fkCk) ( AW-l V 

^ V A-W / 

where k is the total number of words and fk{k) is the 
frequency at that point in segmentation of the kth 
word. It effectively approximates to the relative fre- 
quency 

fk(k) 
k 

as /fe(fc) grows. The language model presented in this 
paper explicitly claims to use this specific estimator 
for the word probabilities. From this perspective, both 
MBDP-1 and the present model tend to favor the seg- 
menting out of familiar words that do not overlap. In 
this context, we are curious to see how the algorithms 
would fare if in fact the utterances were favorably or- 
dered, that is, in order of increasing length. Clearly, 
this is an important advantage for both algorithms. 
The results of experimenting with a generalization of 
this situation, where instead of ordering the utterances 
favorably, we treat an initial portion of the corpus as 
a training component effectively giving the algorithms 
free word boundaries after each word, are presented in 
Section 0. 
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Figure 4. Word discovery recall as a function of number of 
utterances considered. 
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Figure 5. Lexicon precision (percentage of correctly in- 
ferred words in the lexicon) as a function of number of 
utterances considered. 



In contrast with MDBP-1, we note that the model pro- 
posed in this paper has been entirely developed along 
conventional lines and has not made the somewhat 
radical assumption of treating the entire observed cor- 
pus as a single event in probability space. Assuming 
that the corpus consists of a single event requires the 
explicit calculation of the probability of the lexicon in 
order to calculate the probability of any single segmen- 
tation. This calculation is a nontrivial task since one 
has to sum over all possible orders of words in the lex- 
icon, L. This fact is recognized in ? (?), where the 
expression for P(L) is derived in Appendix 1 of his pa- 
per as an approximation. One can imagine then that 
it will be correspondingly more difhcult to extend the 
language model in ? (?) past the case of unigrams. 
As a practical issue, recalculating lexicon probabilities 
before each segmentation also increases the running 



time of an implementation of the algorithm. 

Furthermore, the language model presented in this pa- 
per estimates probabilities as relative frequencies using 
the commonly used back-off procedure and so they do 
not assume any priors over integers. However, MBDP- 
1 requires the assumption of two distributions over in- 
tegers, one to pick a number for the size of the lexicon 
and another to pick a frequency for each word in the 
lexicon. Each is assumed such that the probability of 
a given integer P(«) is given by We have since 

found some evidence suggesting that the choice of a 
particular prior does not have any significant advan- 
tage over the choice of any other prior. For exam- 
ple, we have tried running MBDP-1 using P(z) = 
and still obtained comparable results. It is notewor- 
thy, however, that no such subjective prior needs to be 
chosen in the model presented in this paper. 

The other important difference between MBDP-1 and 
the present model is that MBDP-1 assumes a uniform 
distribution over all possible word orders and explicitly 
derives the probability expression for any particular 
ordering. That is, in a corpus that contains rifc distinct 
words such that the frequency in the corpus of the ith 
distinct word is given by /fc(j), the probability of any 
one ordering of the words in the corpus is 

nr=i/feW! 



fc! 

because the number of unique orderings is precisely the 
reciprocal of the above quantity. In contrast, this in- 
dependence assumption is already implicit in the uni- 
gram language model adopted in the present approach. 
Brent mentions that there may well be efhcient ways 
of using n-gram distributions within MBDP-1. How- 
ever, the framework presented in this paper is a formal 
statement of a model that lends itself to such easy n- 
gram extensibility using the back-off scheme proposed. 
It is now a simple matter to include bigrams and tri- 
grams among the tables being learned. Since back-off 
has already been incorporated into the model, we sim- 
ply substitute for the probability expression of a word 
(which currently uses no history), the probability ex- 
pression given its immediate history (typically n — 1 
words). Thus, we use an expression like 
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otherwise 



where P{w\h) denotes the probability of word w condi- 
tioned on its history h, normally the immediately pre- 
vious 1 (for bigrams) or 2 (for trigram) words, a is the 
back-off weight or discount factor, which we may cal- 
culate using any of a number of standard techniques. 



for example by using the Wittcn-BcU technique as we 
have done in this paper, C() denotes the count or fre- 
quency function of its argument in its respective table, 
and h' denotes reduced history, usually by one word. 
Reports of experiments with such extensions can, in 
fact, be found in a forthcoming article (?). 

7. Training 

Although we have presented the algorithm as an unsu- 
pervised learner, it is interesting to compare its respon- 
siveness to the effect of training data. Here, we extend 
the work in ? (?) by reporting the effect of training 
upon the performance of both algorithms. Figures |^ 
and |7| plot the results (precision and recall) over the 
whole input corpus, that is, blocksize = cx3, as a func- 
tion of the initial proportion of the corpus reserved for 
training. This is done by dividing the corpus into two 
segments, with an initial training segment being used 
by the algorithm to learn word and phoneme probabil- 
ities and the latter actually being used as the test data. 
A consequence of this is that the amount of data avail- 
able for testing becomes progressively smaller as the 
percentage reserved for training grows. So, the signifi- 
cance of the test would diminish correspondingly. We 
may assume that the plots cease to be meaningful and 
interpretable when more than about 75% (about 7500 
utterances) of the corpus is used for training. At 0%, 
there is no training information for any algorithm, and 
the performances of the various algorithms are identi- 
cal to those of the unsupervised case. We increase the 
amount of training data in steps of approximately 1% 
(100 utterances). For each training set size, the results 
reported are averaged over 25 runs of the experiment, 
each over a separate random permutation of the cor- 
pus. The motivation was both to account for ordering 
idiosyncrasies and to smooth the graphs to make them 
easier to interpret. 

We interpret Figures ^ and as suggesting that the 
performance of all the discussed algorithms can be 
boosted significantly with even a small amount of 
training. It is also noteworthy and reassuring to see 
that, as one would expect from results in computa- 
tional learning theory (?), the number of training ex- 
amples required to obtain a desired value of precision, 
p, appears to grow with 1/(1 —p)- 

Significance of single word utterances 

The results we have obtained provide some insight into 
the actual learning process, which appears to be one 
in which rapid bootstrapping happens with very lim- 
ited data. As we had remarked earlier, all the internal 
tables are initially empty. Thus, the very first utter- 



100 




82 -I 



10 20 30 40 60 60 70 80 90 100 

Percentage of input used for training 

Figure 6. Responsiveness of the algorithm to training in- 
formation. The horizontal a^xis represents the initial per- 
centage of the data corpus that was used for training the 
algorithm. This graph shows the improvement in segmen- 
tation precision with training size. 



ance is necessarily segmented as a single novel word. 
The reason that fewer novel words are preferred ini- 
tially is this: Since the word table is empty when 
the algorithm attempts to segment the first utterance, 
backing-off causes all probabilities to necessarily be 
computed from the level of phoncuic^s up. Thus, the 
more words in it, the more sentinel characters that will 
be included in the probability calculation and so that 
much lesser will be the corresponding segmentation 
probability. As the program works its way through 
the corpus, correctly inferred words, by virtue of their 
relatively greater preponderance compared to noise, 
tend to dominate the distributions and thus dictate 
how future utterances are segmented. 

From this point of view, we see that the presence of 
single word utterances is of paramount importance to 
the algorithm. Fortunately, very few such utterances 
suffice for good performance, for every correctly in- 
ferred word helps in the inference of other words that 
are adjacent to it. This is the role played by training, 
whose primary use can now be said to be in supplying 
the word table with seed words. We can now further 
refine our statement about the importance of single 
word utterances. Although single word utterances are 
important for the learning task, what arc c;ritic;ally im- 
portant are words that occur both by themselves in 
an utterance and in the context of other words after 
they are first seen. This brings up a potentially in- 
teresting issue. Suppose disfluencies in speech can be 
interpreted, in some sense, as free word boundaries. 
We are then interested in whether their distribution 
in speech is high enough in the vicinity of generally 
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Figure 7. Improvement in segmentation recall with train- 
ing size. 



frequent words. If that is the case, then urns and aks 
are potentially useful from a cognitive point of view to 
a person acquiring a lexicon since these are the very 
events that will bootstrap his or her lexicon with the 
initial seed words that are instrumental in the rapid 
acquisition of further words. 

8. Summary 

In summary, we have presented a formal model of word 
discovery in speech transcriptions. The main advan- 
tages of this model over that of ? (?) arc, first, that 
the present model has been developed entirely by di- 
rect application of standard techniques and procedures 
in speech processing. Second, the model is easily ex- 
tensible to incorporate more historical detail in the 
usual way. Third, the presented model makes few 
assumptions about the nature of the domain and re- 
mains as far as possible conservative and simple in its 
development. Results from experiments suggest that 
the algorithm performs competitively with otlic;! un- 
supervised techniques recently proposed for inferring 
words from transcribed speech. Finally, although the 
algorithm is originally presented as an unsupervised 
learner, we have shown the effect that training data 
has on its performance. 

Future work 

Other extensions being worked on include the incor- 
poration of more complex phoneme distributions into 
the model. These are, namely, the biphone and tri- 
phone models. Using the lead from ? (?), attempts to 
model more complex distributions for words such as 
those based on template grammars and the systematic 
incorporation of prosodic, stress and phonotactic con- 



straint information into the model are also the subject 
of current interest. We already have sonic unpublished 
results suggesting that biasing the segmentation using 
a constraint that every word must have at least one 
vowel in it dramatically increases segmentation preci- 
sion from 67.7% to 81.8%, and imposing a constraint 
that words can begin or end only with permitted clus- 
ters of consonants increases precision to 80.65%. 

Another avenue of current research is concerned with 
iterative sharpening of the language model wherein 
word probabilities are periodically reestimated using 
a fixed number of iterations of the Expectation Mod- 
ification (EM) algorithm (?). Such reestimation has 
been found to improve the performance of language 
models in other similar tasks. It has also been sug- 
gested that the algorithm could be usefully adapted to 
user modeling in human-computer interaction, where 
the task lies in predicting the most likely atomic ac- 
tion a computer user will perform next. However, we 
have as yet no results or work to report on in this area. 
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